Maximizing Throughput of Multi-user Parallel Data Processing Systems

ABSTRACT

The invention provides systems and methods for maximizing revenue generating throughput of a multi-user parallel data processing platform across a set of users of the service provided with the platform. The invented techniques, for any given user contract among the contracts supported by the platform, and on any given billing assessment period, determine a level of a demand for the capacity of the platform associated with the given contract that is met by a level of access to the capacity of the platform allocated to the given contract, and assess billables for the given contract at least in part based on such met demand and a level of assured access to the capacity of the platform associated with the given contract, as well as billing rates, applicable for the given billing assessment period, for the met demand and the level of assured access associated with the given contract.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims the benefit of the following provisionalapplication, which is incorporated by reference in its entirety:

-   [1] U.S. Provisional Application No. 61/556,065, filed Nov. 4, 2011.

This application is also related to the following, each of which isincorporated by reference in its entirety:

-   [2] U.S. Utility application Ser. No. 13/277,739, filed Oct. 20,    2011;-   [3] U.S. Utility application Ser. No. 13/270,194, filed Oct. 10,    2011;-   [4] U.S. Provisional Application No. 61/539,616, filed Sep. 27,    2011;-   [5] U.S. Utility application Ser. No. 13/184,028, filed Jul. 15,    2011; and-   [6] U.S. Provisional Application No. 61/476,268, filed Apr. 16,    2011.

BACKGROUND

1. Technical Field

This invention pertains to the field of digital data processing,particularly to the field of techniques for maximizing data processingthroughput per unit cost across a set of software programs dynamicallysharing a data processing system comprising multiple processing cores.

2. Descriptions of the Related Art

Computing systems will increasingly be based on large arrays ofprocessing cores, particularly in higher capacity server type computers.The multi-core computing hardware will often be shared by a number ofsoftware applications, some of which may belong to different users,while also individual software applications will increasingly beexecuting on multiple processing cores in parallel. As a result, the setof application program processing tasks running on the set of cores of agiven multi-core based computer will need to be updated, potentiallyhighly frequently, in order to pursue sufficiently high applicationprogram level as well as system wide processing throughput. Tocost-efficiently enable such dynamic application task switching on aparallel computing platform, novel multi-user parallel computingarchitectures are needed to support efficiently transferring theprocessing context of any given task to any core of the system as wellas to facilitate efficient communication among the tasks of any givenapplication program running on the multi-core data processing system.Moreover, innovations are needed regarding effective pricing and billingof user contracts, to increase the parallel computing cost-efficiencyboth for the users and provider of the computing service. Particularchallenges to be solved include providing effective compute capacityservice unit pricing model and billing techniques with appropriateincentives and tools to optimally spread users' data processing loads intime and space across the available parallel data processing resources,in order to pursue maximization of data processing throughput per unitcost for the users as well as maximization of profits for the serviceprovider.

SUMMARY

The invention provides systems and methods for maximizing revenuegenerating data processing throughput of a multi-user parallelprocessing platform across a set of users of the computing capacityservice provided with the platform. More specifically, the inventioninvolves billing techniques, which, on any given billing assessmentperiod among a series of successive billing assessment periods, and forany given user contract among a set of user contracts supported by thegiven platform: 1) observe a level of a demand for a capacity of theplatform associated with the given contract that is met by a level ofaccess to the capacity of the platform allocated to the given contract,and 2) assess billables for the given contract at least in part based oni) its level of assured access to the capacity of the platform and ii)the observed met portion of its demand for the platform capacity.Various embodiments of such billing techniques further include variouscombinations of additional steps and features, such as features whereby:a) the assessing is done furthermore based at least in part on billingrates for units of the level of assured access associated with the givencontract and/or units of the met demand associated with the givencontract, with said billing rates being set to different values ondifferent billing assessment periods, in order to increase thecollective billables associated with the user contracts supported by agiven platform, b) the capacity of the platform is periodically, oncefor each successive capacity allocation period, re-allocated among theuser contracts at least in part based on: i) the level of assured accessto the capacity of the platform associated with the given contract andii) the demand for the capacity of the platform by the given contract,and c) at least one of said steps of observing, reallocating andassessing is done by digital hardware logic operating without softwareinvolvement on at least some of the billing assessment periods.According to certain embodiments of the invention, the billingtechniques operate based on time periods for which units of theprocessing capacity, e.g. CPU cores of a multi-core array, areperiodically reallocated, and such time periods, referred to as capacityor core allocation periods (CAPs), i.e., time periods during which thecapacity allocation at the platform, as well as the billing rates,remain constant, are configured to consist of a specified number ofprocessing clock cycles. In other embodiments, the invented billingtechniques operate based on time periods during which the billing ratesremain constant but during which the capacity of the platform can bereallocated, and in such embodiments, the concepts of demand based coreallocations (DBCAs) and core entitlements (CEs) for a given program, forbilling purposes, refer to the average DBCA and CE levels, respectively,over the time periods for which the related billing rates remainedconstant. Collectively, these time periods based on which the inventedbilling techniques operate are referred to as billing assessment periods(BAPs).

An aspect of the invention provides a system for improving dataprocessing service throughput per a unit cost of the service throughbilling adjustment techniques, with said system comprising digital logicfor: 1) allocating an array of processing cores for processing softwareprograms of a set of users of the service, and 2) assessing billablesfor the service for each given user of the service on successive BAPsbased at least in part on quantities of cores among said array that anygiven user i) has a contractual entitlement for being allocated on eachCAP of any given BAP if so demanded, with such a quantity referred to asa Core Entitlement (CE), and ii) got allocated to meet its expresseddemands for cores on the CAPs of the given BAP, with such a quantityreferred to as a Demand Based Core Allocation (DBCA). Variousembodiments of such systems further include various combinations ofadditional features, such as features whereby a) the digital logic forassessing the billables for a given user for the service involves logicthat multiplies the user's CE established for the given BAP with a CEbilling rate applicable for that BAP, b) the digital logic for assessingthe billables for a given user for the service involves logic thatmultiplies the user's DBCA determined for the given BAP with a DBCAbilling rate applicable for that BAP, and c) the assessing is donefurthermore based on billing rates for CEs and/or DBCAs, with at leastone of the CE or DBCA billing rates being varied between the successiveBAPs, to optimally spread the users' data processing loads for thedynamically allocated array of cores over time, thus maximizing theusers' data processing throughput per unit cost as well as the serviceprovider's billables from the user contracts.

Another aspect of the invention provides a method for improving dataprocessing service throughput per unit cost of the service throughbilling adjustment techniques, with such a method comprising 1)repeatedly, once for each CAP, allocating an array of processing coresfor processing software programs of a set of users of the service, and2) adjusting billables for the service for each given user of theservice on successive BAPs based at least in part on quantities of coresamong said array that any given user i) has an entitlement for beingallocated on each CAP of a given BAP and ii) got allocated to meet itsdemands for cores on the CAPs of the given BAP. Various embodiments ofsuch methods further include various combinations of further steps andfeatures such as those whereby a) the allocating is done at least inpart based on entitlements and/or demands for cores among said array byone or more of the software programs of the set of users, and b) theadjusting is furthermore done based at least in part on a value ofrespective billing rates, applicable for a given BAP, for cores amongsaid array that the given user's software program i) has an entitlementfor on the given BAP and ii) got allocated to meet its demands for coreson the CAPs of the given BAP.

A further aspect of the invention provides a system for improving therevenue generation capability of a data processing platform for theoperator of the platform providing computing capacity services for usersdynamically sharing the platform, with the platform having a certaincost to its operator and a certain pool of processing resources forexecuting the users' software programs. Such a system comprises digitallogic for: 1) allocating the pool of resources of the platform forprocessing the user programs at least in part based on the users'respective entitlements for the pool of resources, 2) adjusting abilling rate for the user's entitlements, for individual BAPs, at leastin part based on a relative popularity of the entitlements on theindividual BAPs among successive BAPs, and 3) digital logic fordetermining billables associated with each of the user programs, basedat least in part on the adjusting of the billing rate for theentitlements. Various embodiments of such systems further includevarious combinations of additional features, such as features by whicha) the pool of resources comprises an array of processing cores that areperiodically allocated among the user software programs, b) the digitallogic for allocating performs its allocating of the pool of resourcesfurthermore at least in part based on demands for the resources amongsaid pool by the user programs, c) the system further comprises digitallogic for adjusting, for successive BAPs, a billing rate for resourcesamong the pool allocated to a user program to meet a demand expressed bythe user program for such resources, with the determining being basedfurthermore at least in part on the adjusting of the billing rate forsuch resources allocated based on demand, and d) the digital logicsubsystems for allocating, adjusting and determining comprise hardwarelogic that, on at least some BAPs among the successive BAPs, operatesautomatically without software involvement

Yet another aspect of the invention provides a method for improving arevenue generation capability of a data processing platform that has acertain pool of processing resources and that is dynamically sharedamong a set of user software programs. Such a method comprises 1) oncefor each new capacity allocation period, allocating the pool ofresources for processing user programs at the platform at least in partbased on respective entitlements for the pool of resources by the userprograms, 2) adjusting a billing rate for said entitlements, forsuccessive BAPs, at least in part based on a relative popularity of theentitlements on individual BAPs among the successive BAPs, and 3)determining, based at least in part on said adjusting, billablesassociated with the user programs. Various embodiments of such methodsfurther include various combinations of further steps and features suchas those whereby a) the determining further involves, for a given userprogram, and for individual BAPs among the successive BAPs, multiplyingthe entitlements for the pool of resources by the given user program ona given BAP with the billing rate for the entitlements on the given BAP,b) the allocating, on a given capacity allocation period, furthermore isbased at least in part on respective demands for the resources among thepool by the user programs for the given capacity allocation period, c)there further is a step of adjusting, for successive BAPs, a billingrate for resources among the pool allocated to a user program to meet ademand by the user program for such demand based resource allocations,and with the determining being furthermore based at least in part on theadjusting of the billing rate for the demand based resource allocations,and d) the determining furthermore is based at least in part on i)resolving a level of the resources among the pool allocated to a userprogram to meet a demand by the user program for such resources on agiven BAP and/or ii) applying a billing rate applicable for theresources resolved in i) on the given BAP.

In embodiments of the invention, either or both of the user's capacityentitlement and demanded based capacity allocation billing rates, asdiscussed above, can be set for different values on the different hoursof day, days of week, seasons, special calendar dates, etc., in order tooptimally spread the collective processing load of the user programs fora given platform over time, and thereby, increase the cost efficiencyfor the users of the computing service provided with the given platformand/or the revenue generation capability for the service provideroperating the platform. For example, popular time periods for computingcapacity services, which with flat billing rates would experiencehighest demands for the platform processing capacity, can according toembodiments of the invention be configured with premium billing rates,in order to incentivize the users to shift the execution of theirnon-time-critical programs and task (e.g. asynchronous, background orovernight batch processes) for execution on the otherwise less popular,discounted billing rate, time periods. Moreover, the capability perembodiments of the invention to facilitate optimally combining usercontracts with complementary core entitlement (CE) time profiles on agiven dynamically shared computing platform allows the service provideroperating the platform to support a given set of user contracts withreduced total platform core capacity i.e. at reduced cost base, andthereby increase the competitiveness of the offered compute capacityservice offering among the prospective customers in terms ofprice-performance. Examples of such user applications with mutuallycomplementary i.e. minimally overlapping CE time profile peaks, whichcould be efficiently combined for the set of user programs todynamically share a given platform per the invention, are realtimeenterprise software applications (demanding peak performance duringbusiness hours), consumer media and entertainment applications(demanding peak performance in evening hours and during weekend) andovernight batch jobs (demanding peak capacity before the business hoursof the day). Note also that a further advantage of embodiments ofinvented billing techniques is that, because a portion of the cost ofthe utility computing service for a user running its program on theplatform is based on the (met) levels of core demands expressed by theuser's program, the users of the compute capacity service provided witha computing platform utilizing the invented billing techniques have aneconomic incentive to configure their programs so that they eliminatecore demands beyond the number of cores that the given program isactually able to effectively utilize at the given time. As the userapplications thus are not to automatically demand at least their CEworth of cores irrespective of on how many cores the given program isable to execute on in parallel at any given time, the average amount ofsurplus cores for runs of the core allocation algorithm, i.e., coresthat can be allocated in a fully demand driven manner (rather than in amanner to just meet the core demands by each application for their CEworth of cores), is increased, compared to a case where the users wouldnot have the incentive to economize with their core demands. Suchmaximally demand driven core allocation (which nevertheless allowsguaranteeing each user application an assured deterministic minimumsystem capacity access level, whenever actually demanded) facilitatesproviding maximized user program data processing throughput per unitcost across the set of user applications dynamically sharing a givencomputing platform per the invention. Consequently, this maximization ofdata processing throughput service per unit cost by the invention alsodrives the maximization of the profitability for computing capacityservice provider operating such given platform per the invention.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 shows, in accordance with an embodiment of the invention, afunctional block diagram for an application program load adaptiveparallel data processing system, comprising a multi-core fabric, membercores of which are dynamically space and time shared among processingtasks of a set of software programs, with the sharing of fabric of coresamong the software programs controlled by a system controller module,and with switching of program tasks to their execution cores andinter-task communications handled through an efficient fabric network.In general context of systems per FIG. 1, FIGS. 2-3 illustrate operationand internal structure of the system controller, and FIGS. 4-7 those ofthe fabric network. As shown in FIG. 3, the controller also involves abilling subsystem, the internal modules and operating of which areillustrated in FIG. 8, according to an embodiment of the invention.

FIG. 2 provides a context diagram for a process, implemented by thecontroller of a system per FIG. 1, to select and map active tasks ofapplication programs configured to run on the system to their targetprocessing cores, in accordance with an aspect of the invention.

FIG. 3 illustrates, in accordance with an aspect of the invention, aflow diagram and major steps for the system controller process per FIG.2, as well as operating context for the contract billing subsystemillustrated in FIG. 8.

FIG. 4 illustrates, in accordance with an embodiment of the invention, anetwork and memory architecture for the multi-core fabric of a systemper FIG. 1.

FIG. 5 shows at more detail level a portion of an embodiment of a logicsystem per FIG. 4 concerning functions of backing up updated task memoryimages from the cores of the fabric to the task specific segments infabric memories, as well as writing of inter-task communicationinformation by tasks of application programs running on the system tosuch memory segments of each others.

FIG. 6 shows at more detail level an embodiment of a portion of a logicsystem per FIG. 4 concerning functions of retrieving updated task memoryimages from the task specific segments in the fabric memories to theirnext processing cores within the fabric, as well as reading ofinter-task communication information by tasks of applications running onthe system from their segments in such memories.

FIG. 7 presents at further level of detail an embodiment of logicfunctionality for the subsystem per FIG. 5, concerning a capability fortasks of an application program to write information to each other's,including their own, memory segments within the multi-core fabric.

FIG. 8 illustrates, in accordance with an aspect of the invention, abilling subsystem for a multi-user parallel processing platform of FIG.1.

DETAILED DESCRIPTION

The invention is described herein in further detail by illustrating thenovel concepts in reference to the drawings. General symbols andnotations used in the drawings:

-   -   Boxes indicate a functional digital logic module; unless        otherwise specified for a particular embodiment, such modules        may comprise both software and hardware logic functionality.    -   Arrows indicate a digital signal flow. A signal flow may        comprise one or more parallel bit wires. The direction of an        arrow indicates the direction of primary flow of information        associated with it with regards to discussion of the system        functionality herein, but does not preclude information flow        also in the opposite direction.    -   A dotted line marks a border of a group of drawn elements that        form a logical entity with internal hierarchy, such as the        modules constituting the multi-core processing fabric 110 in        FIG. 1.    -   Lines or arrows crossing in the drawings are decoupled unless        otherwise marked.    -   For clarity of the drawings, generally present signals for        typical digital logic operation, such as clock signals, or        enable, address and data bit components of write or read access        buses, are not shown in the drawings.

FIGS. 1-3 and related descriptions below provide specifications for amulti-core data processing platform, according to embodiments of aspectsof the invention, while FIGS. 4-7 and associated descriptions providespecifications for networking and memory resources to enable dynamicallyrunning any selected data processing task on any processing core of theplatform as well as to support efficient communications among suchtasks, according to embodiments of aspects of the invention. Finally,FIG. 8 and related specifications describe embodiments for billingsubsystem for a multi-user parallel processing platform per precedingFIGS., along with operating scenario examples for the billing rateadjustment based computing cost efficiency maximization techniques.

FIG. 1 provides a functional block diagram for an embodiment of theinvented multi-core data processing system dynamically shared among dataprocessing tasks of application software programs, with capabilities forapplication processing load adaptive allocation of the cores among thesoftware applications configured for the system, as well as (asdescribed in relation to FIGS. 4-7) efficient inter-core task-switchingand inter-task communication resources, as well as (as described inrelation to FIG. 8) user's cost efficiency and the compute serviceprovider's profit maximizing pricing adjustment and billing techniques.

Note that the terms software program, application program, applicationand program are used interchangeably in this specification, and eachgenerally refer to any type of computer software able to run on dataprocessing systems according to any embodiments of the invention. Also,references to a “set of” units of a given type, such as programs, logicmodules or memory segments can, depending on the nature of a particularembodiment or operating scenario, refer to any positive number of suchunits.

For general context, the system per FIG. 1 comprises processing corefabric 110 with an array 115 of cores 120 for processing instructionsand data of a set of software application programs configured run on theshared system 100. While in such a manner processing the applicationprograms to produce processing results and outputs, the cores of thesystem access their input and output data arrays, which in embodimentsof the invention comprise memories and input/output communication portsaccessible directly or indirectly to one or more of the cores. Since thediscussion herein is directed primarily to techniques for dynamicallysharing the processing cores of the system among its applicationprograms as well as for efficiently running such programs on the coresof the system in parallel, rather than on implementation details of thecores themselves, aspects such as memories and communication ports ofthe cores or the system 100, though normally present within embodimentsof the multi-core data processing system 100, are not shown in FIG. 1.Moreover, it shall be understood that in various embodiments, any of thecores 120 of a system 100 can comprise any types of software programprocessing hardware resources, e.g. central processing units (CPUs),graphics processing units (GPUs), digital signal processors (DSPs) orapplication specific processors (ASPs) etc. as well as time share orother abstractions or virtualizations thereof. Embodiments of systems100 can furthermore incorporate CPUs etc. processing cores that are notpart of the dynamically allocated array 115 of cores, and such CPUs etc.outside the array 115 can in certain embodiments be used to manage andconfigure e.g. system-wide aspects of the entire system 100, includingthe controller module 140 of the system and the array 115. For theoperator to configure and monitor the system 100, embodiments of theinvention provide a management interface 319 to pass information betweenthe operator's management tools (e.g. user interface and relatedsoftware on a PC or terminal) and the data processing platform 100. Notethat in various embodiments the actual software programs for configuringand monitoring processes for the system 100 can be run as an application220 of a system (including either on the same or different instance ofsystem 100 than the given instance being managed) as well as suchsoftware programs can be run in a different platforms, e.g. on a PCsupporting the management user interface.

As illustrated in FIG. 1, the invention provides a data processingsystem 100 comprising an array 115 of processing cores 120, which aredynamically shared by a set of application programs configured to run onthe system. In an embodiment of the invention, the individualapplication programs running on the system maintain at specifiedaddresses within the system 100 memories their processing capacitydemand indicators signaling 130 to the controller 140 a level of demandof the system processing capacity by the individual applications. In aparticular implementation, each of these indicators 130, referred toherein as core-demand-figures (CDFs), express how many cores 120 theirassociated application program is presently able utilize for its dataprocessing tasks. Moreover, in certain embodiments, the individualapplications maintain their CDFs at specified registers within thesystem, e.g. in known addresses within the memory space of their rootprocesses (i.e. task ID#0 of each application), with such applicationCDF registers being accessible by hardware logic of the controllermodule 140. For instance, in an embodiment, the CDF 130 of a givenapplication program is a function of the number of its schedulabletasks, such as processes, threads or functions (referred to collectivelyas tasks) that are ready to execute at a given time. In a particularembodiment of the invention, CDF of an application program expresses onhow many processing cores the program is presently able to execute inparallel. Moreover, in certain embodiments, these capacity demandindicators, for any given application, include a list 135 identifyingits ready tasks in a priority order.

A hardware logic based controller module 140 within the system, througha repeating process, allocates and assigns the cores 120 of the system100 among the set of applications and their tasks, at least in partbased on the CDFs 130 of the applications. In certain embodiments, thisapplication task to core assignment process 300 (see FIGS. 2 and 3) isexercised periodically, e.g. at even intervals such as once per a givennumber (for instance 64, or 1024, or so forth) of processing core clockor instruction cycles. In other embodiments, this process 300 can be rune.g. based on a change in the CDFs 130 of the applications 220. Also, inparticular implementation scenarios, the conceptual module 140 includesapplication program specific sub-modules, which run task to coreassignment algorithms within a given application program based on achange in the task priority listing 135 for the given application. Whilesuch conceptual application-specific sub-modules can impact whichapplication tasks will be executing on the fabric 110, they will not bythemselves change the numbers of cores allocated to any givenapplication on the system. Accordingly, these application-internal taskselection sub-processes can be run also in between of successive runs ofthe complete core allocation and assignment process 300. The applicationtask to core assignment algorithms of controller 140 produce, for thecores of the fabric 115, identification of their respective tasks toprocess 460, as well as for the application tasks on the system,identification of their execution cores 420 (if any, at a given time).Note that the verb “to assign” is used herein reciprocally, i.e., it canrefer, depending on the perspective, both to assignment of cores 120 totasks 240 (see FIG. 2) as well as to mapping of tasks 240 to cores 120.This is due to that, in the embodiments studied here in greater detail,the allocation and mapping algorithms of the controller 140 cause onetask 240 to be assigned per any given core 120 of the array 115 by eachrun of such algorithms 300 (see FIGS. 2 and 3). As such, when it iswritten here, e.g., that a particular core #X is assigned to process agiven task #Y, it could have also been said that task #Y is assigned forprocessing by core #X. Similarly, references such as “core #X assignedto process task #Y”, could be written in the (more complex) form of“core #X for processing task #Y assigned to it”, and so forth.

Though not explicitly shown in FIG. 1, embodiments of the system 100also involve timing and synchronization control information flowsbetween the controller 140 and the core fabric 115, to signal eventssuch as launching and completion of the process 300 (FIGS. 2-3) by thecontroller as well as to inform about the progress of the process 300e.g. in terms of advancing of its steps (FIG. 3). Also, in embodimentsof the invention, the controller module is implemented by digitalhardware logic within the system, and in particular embodiments, suchcontroller modules exercise their repeating algorithms, including thoseof process 300 per FIGS. 2-3, without software involvement. Embodimentsfor the communications network and memory resources 400 of themulti-core fabric 110 are described in relation to FIGS. 4-7.

FIG. 2 illustrates context for the process 300 performed by thecontroller logic 140 of the system 100, repeatedly mapping theto-be-executing tasks 240 of the set of application programs 210 totheir target cores 120 within the array 115. In an embodiment, eachindividual application 220 configured for a system 100 provides a(potentially updating) collection 230 of its tasks 240, even though forclarity of illustration in FIG. 2 this set of applications tasks isshown only for one of the applications within the set 210 ofapplications configured for a given instance of system 100. For the sakeof illustration and description, the cores of within the array 115 areherein identified with their core ID numbers from 0 through the numberof cores within the array 115 less 1, the applications within the set210 with their application ID numbers from 0 through the number ofapplications among that set 210 less 1, and the set of tasks 230 of anygiven application with their task ID numbers from 0 through the numberof tasks supported per the given application less 1.

Note also that in certain embodiments, any application program instance220 for a system 100 can be an operating system (OS) for a given user orusers of the system 100, with such user OS supporting a number ofapplications of its own, and in such scenarios the OS client 220 on thesystem 100 can present such applications of it to the controller 140 ofthe system as its tasks 240.

Moreover, in embodiment of the invention, among the applications 220there can be supervisory or maintenance software programs for the system100, used for instance to support configuring other applications 220 forthe system 100, as well as provide general functions such as systemboot-up and diagnostics, and facilitate access to networking, I/O andsystem-wide memory etc. resources of the platform 100 also by otherapplication programs of the system.

In the general context per FIGS. 1 and 2, FIG. 3 provides a conceptualdata flow diagram for an embodiment of the process 300, which maps eachselected-to-execute application task 240 within the sets 230 to one ofthe cores 120 within the array 115.

More specifically, FIG. 3 presents, according to an aspect of theinvention, conceptual major phases of the task-to-core mapping process300, used for maximizing the application program processing throughputof a data processing system hardware shared among a number of softwareprograms. Such process 300, repeatedly mapping the to-be-executing tasksof a set of applications to the array of processing cores within thesystem, involves series of steps as follows:

-   (1) allocating 310 the array of cores among the set of programs on    the system, at least in part based on CDFs 130 by the programs, to    produce for each program 220 a number of cores 220 allocated to it    315 (for the time period in between the current and the next run of    the process 300); and-   (2) based at least in part on the allocating 310, for each given    application that was allocated at least one core: (a) selecting 320,    according to the task priority list 135, the highest priority tasks    within the given application for execution corresponding to the    number of cores allocated to the given application, and (b) mapping    330 each selected task to one of the available cores of the array    115, to produce, i) for each core of the array, an identification    460 of an application and a task within the application that the    given core was assigned to, as well as ii) for each application task    selected for execution on the fabric 115, identification 420 of its    assigned core.    The repeatedly produced and updated outputs 420, 460 of the    controller 140 process 300 will be used for repeatedly    re-configuring connectivity through the fabric network 400, as    described in the following with references to FIGS. 4-7.

FIG. 3 also provides context for the user billing processes for theutility computing services provided with the platform 100. According toembodiments of the invention, these billing processes involveper-application billing counters 316, implementation and operatingscenarios for which are discussed in relation to FIG. 8.

FIGS. 4-7. and related specifications below describe embodiments foron-chip network 400 of a system 100 and operating scenarios thereof, toachieve non-blocking transferring of memory images of tasks of softwareprograms between cores of the fabric 110, as well as inter-taskcommunication, through efficiently arranged access to fabric memories.The inter-core and inter-task information exchange resources per FIGS.4-7, in an embodiment of the invention, comprise hardware logic, and arecapable of operating without software. The capabilities per FIGS. 4-7provide logic, wiring, memory etc. system resource efficient support forexecuting any application task 240 at any core 120 within the system atany given time, as controlled, at least in part, by the controller 140that regularly optimizes the allocation of cores of the array 115 amongthe applications 220 on the system 100, as well as maps specificapplication tasks 240 to specific processing cores 120. The minimumoverhead inter-task communications, also supported by the on-chipnetwork 400, further enables resource efficiently achieving highperformance for the application software programs 210 that dynamicallyshare the multi-core based data processing platform 100.

FIG. 4 illustrates the task image transfer and inter-task communicationsnetwork and memory resources 400 for an embodiment of the core fabric110 (see FIG. 1 for further context of the conceptual module 400). Notethat in FIGS. 3-8, for clarity of illustration of the functionality ofthe inter-core and inter-task communications facilities, certain signalsthat are primarily timing or control signals (as contrasted with databuses and such) are marked with gapped-line arrows. Examples of suchcontrol signals are control information flows provided to direct themultiplexing of the read and write data buses, as well as signalsproviding timing control.

Fabric Network for System Per FIG. 1: Transferring Memory Images ofTasks of Software Programs Executing on the System Between Cores andBackup Memories of the Multi-Core Processing Fabric:

Regarding system functionality for switching executing tasks for coresof fabric 110, FIG. 4 provides a conceptual diagram for a logic system400 to back-up and transfer the latest processing memory image (referredto herein on herein also simply as image) of any application programtask 240 on the system 100 from and to any core 120 within the array115, in accordance with an embodiment of the invention. As will bedescribed later on (after the description of FIG. 6), the inter-corenetwork and memory system 400 will, at least in certain embodiments, beused also for inter-task communication among the application programtasks running on the system 100. Note that in relation to FIGS. 4-7, inembodiments of the invention where the individual core specific memorieswithin the array are not intended to contain the instructions and datafor all the application tasks on the system, but rather for the specifictask assigned for any individual core at a given time, the notion oftask processing image refers to the memory image used by the processingof the task. Various embodiments, implementing various designs between(and including) the extremes, on one end, of each core providing adedicated memory segment for each application task on the system and, onthe other end, of each core providing a plain working memory holding thememory image of the application task assigned to it, will have theircorresponding definitions of what information needs to be transferredbetween cores and interim memories (if any) to backup, retrieve orrelocate a task. In scenarios studied in detail in the following inconnection with FIGS. 4-7, it is assumed that each core of the array 115holds in its memory the image of the application task assigned to it ata given time. Such a scenario significantly reduces the amount of memoryneeded by the individual cores as well as across the system 100, whileit calls for a capability to transfer the task processing memory imagesbetween cores and back-up memories when having to resume processing of atask after a period of inactivity, possibly at a different core than itsprevious processing core. FIGS. 4-6 and related descriptions belowillustrate a logic system with such a memory image transfer capability.

At a digital logic design level, according to the herein studiedembodiments per FIG. 4, the controller 140 identifies 420, to across-connect (XC) 430 between the core array 115 and memory array 450,the appropriate source core from which to select the updated image 440for each given application task specific segment 550 within the memory450. In an alternative embodiment, each core 120 identifies 420 theapplication task ID# along with its updated processing image to the XC430. In addition, at times of task switchover, under control from thecontroller 140, the appropriate updated new task processing images 440are transferred from the memories 450 through another controllercontrolled 460 cross-connect (XC) 470 to each given core of the array115 subject to task switchover 120. Specifically, the controller 140provides for the XC 470 identification 460 of the next application tasks440 for the individual cores of the array 115, which causes theappropriate updated processing image to be transferred 480 from thememory array 450 to each given core of the system 100 subject to taskswitchover. Naturally, any given core for which the assigned applicationtask ID# remains the same on successive core allocation periods (CAPs)can resume processing such task uninterruptedly through such allocationperiod boundaries, without having halt processing.

Note also that in case of certain embodiments, the XCs 430 and 470 arecollectively referred to as a cross-connect between the array 115 ofcores and the memories 450 of the fabric 110. Also, in certainscenarios, the concept of on-chip network refers to the XCs 430 and 470and the fabric and core memory access buses 410, 440, 480 theycross-connect, while in other scenarios, that concept includes also thefabric memories 450.

In a particular operating scenario, at end of any given core to taskallocation period or after the set of tasks of any given applicationselected for execution chances (even within a CAP), for each such corewithin the system that got assigned a different next task to process(with such cores referred to as cores subject to task switchover), theupdated processing image of its latest task is backed up 410 to a memory450 that provides a dedicated memory segment 550 and related accesslogic (FIGS. 5-7) per each application task configured for the system100. Specifically, in an embodiment, controller 140 through logic withcore-specific multiplexers at XC 470 provides, at least conceptually aspart of the bus 480, indications to the cores 120 regarding taskswitchovers, in response to which system software at the cores subjectto a switchover causes the existing task to be backed up 410 to itssegment 550 at memory array 450 and, following that, to retrieve 480 thenext task's image from its segment 550 at memory array 450. Moreover, ina particular embodiment, after a core subject to task switchover hasbacked up 410 its outgoing task, the core will signal back to itsmultiplexer (element 620 in FIG. 6) at XC 470 to apply the provided newconfiguration 460, to cause the incoming application's image to betransferred 480 (under control of the core's system software) to theworking memory of the core, and so that the incoming task assigned toexecute on the core will be connected (in read mode) 480 to its segment550 at memories 450. Furthermore, according to such embodiments, thesystem software on a core subject to switchover also signals tocontroller 140 about completion of backing up its outgoing task, basedon which the controller applies the updated configuration 420, i.e.identification of the incoming task ID#, for the task-specificmultiplexer 510 at XC 430, so that the incoming task assigned to executeon the core is connected (in write mode) 410 to memory segments 550 oftasks of its application 220, as well as so that the core of itsexecution will be connected in write mode to the correct memory segment550 once that task is to be backed up 410 (see also FIG. 5 for furtherdetails). Note further that in certain embodiments of the invention,cores 120 support two sides of their working memories, to allow backingup 410 and retrieving 480 of the outgoing and incoming tasks to proceedconcurrently, by copying 480 the incoming task's image to different sideof the working memory than what was used for the outgoing task's image,and by switching the active side of the working memory to the incomingtask's side following the copying of its image from its segment 550 atthe fabric memories 450. According to certain implementation practices,the cores also provide a memory space for their system software, whichhowever, according to the herein discussed operating scenarios, is notactivated during user application task processing times (i.e. betweentask switchovers).

According to the embodiments of the invention described herein ingreater detail, based on the control 460 by the controller 140 for agiven core indicating that it will be subject to a task switchover, thecurrently executing task is made to stop executing and its processingimage is backed up 410, 520, 540 to the memory 450 (FIGS. 4 and 5), andfollowing (as well as in certain embodiments, during) that, the memoryimage of the next task assigned to execute on the given core isretrieved 610, 480 to the core from the memory 450 (FIGS. 4 and 6).During these application task switching proceedings the operation of thecores subject to task switchover is controlled through the controller140 and system software configured for the cores, with said systemsoftware managing the backing up and retrieving of the outgoing andincoming task memory images from the memories 450, as well as stoppingthe execution of the outgoing task before backing it up and getting theincoming task processing started once the local working memory of thecore is configured with the incoming task's processing image. In thesetype of embodiments, cores not indicated by controller 140 as beingsubject to task switchover are able to continue their processinguninterruptedly even over the CAP transition times without any idletime.

Note that, according to embodiments of the invention described in theforegoing, applying of updated task ID# configurations 460 for the corespecific multiplexers 620 of XC 470 (see FIGS. 4 and 6), as well asapplying of the updated processing core ID# configurations 420 for theapplication task specific multiplexers 510 at XC 430 (see FIGS. 4 and5), can thus be safely and efficiently done on one multiplexer at a timebasis (reducing the system hardware and software implementationcomplexity and thus improving cost-efficiency), since tasks do not needto know whether and at which core in the fabric 115 they or other tasksare executing at any given time. Instead of relying on knowledge of thetheir respective previous, current (if any at any given time) or futureexecution cores by either the tasks or the system software of the cores,the invention enables flexibly running any task of any application atany core of the fabric, while providing inter-task communication morecost-efficiently through connecting the cores to their appropriateapplication task specific segments 550 at the fabric memories 450.

FIG. 5 shows, at a more detail level, a portion of the logic system 400(see FIGS. 1 and 4 for context) for backing up the updated taskprocessing images from the cores of the system 100 to the task specificback-up memories 450, in accordance with an embodiment of the invention.As will be discussed later on, following the description per FIG. 6, thelogic system per FIG. 5 is, in certain embodiments, used also for thetasks of any given application executing on the system 100 to writetheir inter-task communication info for each others.

In the task memory image backup mode of use of the logic per FIG. 5,according to the embodiment studied here in greater detail, each core120 of the array 115 that is subject to task switchover transmits 410,through the XC 430 to its segment 550 in the memories 450 the updatedprocessing image of its latest application task when signaled to do soby controller 140, in embodiments through its associated multiplexer 620at XC 470. The XC 430 comprises, in a particular embodiment, a set ofapplication task specific multiplexers 510, each of which selects theupdated processing image instance from the set 410 corresponding to itstask ID# for writing 540 to its associated task specific segment 550 atthe memory array 450. Each such task specific multiplexer 510 maketheses selections based on control 420 from the controller 140 thatidentifies the core that processed its associated application taskbefore a task switchover. In case a given task was not being processedat a given time, in an embodiment the controller controls 420 themultiplexer 510 instance associated with such task to not write anythingon its associated segment 550 on the memory 450. In addition, the buses410, 525 and 545 include a write enable indicator, along with write dataand address (and any other relevant signals), from their source cores tothe memory segments 550, to control (together with other system logic,e.g. per FIG. 7) write access from cores 120 to memory segments 550. Therole of XC 530 will be described in reference to FIG. 7; for the taskmemory image backup mode, the XC 530 can be considered as beingcontrolled 535 by the controller to simply pass-through connect thewrite access bus 520 of each application task finishing execution on acore of the array 115 to its segment 550 at memories 450.

At digital logic design level, a possible implementation scenario forfunctionality per FIG. 5 is such that the signal bus instance within theset 410 carrying the updated processing images from the core ID #n (n isan integer between 0 and the number of cores in the array less 1) isconnected to the data input #n of each multiplexer 510 of XC 430, sothat the identification 420 of the appropriate source core ID# by thecontroller to a given multiplexer 510 causes XC 430 to connect theupdated task processing image transmissions 410 from the core array 115to their proper task specific segments 550 within the memory 450. In anembodiment, controller 140 uses information from an application task ID#addressed look-up-table per Table 4 format (shown in later in thisspecification) in supplying the latest processing core identifications420 to the application task specific multiplexers 510 of XC 430.

FIG. 6 shows at a greater level of detail, in accordance with anembodiment of the invention, a portion of the logic system per FIG. 4for retrieving the updated task processing images from the task specificback-up memories to their next processing cores within a system perFIG. 1. As will be discussed following this description of FIG. 6, thelogic system per FIG. 6 is, in certain embodiments, used also for thetasks of an application executing on the system 100 to read theirinter-task communication info from each others.

According to the embodiment studied here in greater detail, the XC 470(see FIG. 4 for context) comprises core specific multiplexers 620, eachof which, when operating in the task image transfer mode, selects theupdated image (from set 610) of the task identified 460 for processingby the core associated with a given multiplexer 620 to be transferred480 to the working memory of that core 120.

Similar to the digital logic level description of the multiplexer 510(in connection to FIG. 5), a possible implementation for functionalityillustrated in FIG. 6, is such that the read data bus instance (from set610) associated with application task ID #m (m is an integer between 0and the number of application tasks supported by the system less 1) isconnected to the data input #m of each multiplexer 620 instance, so thatthe identification (by the controller 140) of the active applicationtask ID#460 for each of these core specific multiplexers 620 of XC 470causes the XC 470 to connect each given core 120 of the array 115 inread mode to the segment 550 at memory 450 associated with its indicated460 active application task. In an embodiment, controller 140 usesinformation from a core ID# addressed look-up-table per Table 5 format(shown in later in this specification) in supplying the next applicationtask identifications 460 to the application core specific multiplexers620 of XC 470.

Fabric Network for System Per FIG. 1: Inter-Task Communication AmongSoftware Programs Executing on the Multi-Core Fabric of the System:

In addition to capabilities to activate, deactivate and relocate tasks240 among cores 120 of a system 100 through the task image transfers asoutlined above in connection with FIGS. 4-6, the system 100 enables thetasks 240 of the application programs 220 on the system to communicatewith each other, e.g. to call and return to each other, passing inputand output data (incl. pointers, for instance, to general memory and I/Ofacilities of system 100), between cores of the fabric 110. Suchinter-task communication within an application program executing atsystem 100, in an embodiment of the invention, is handled by usinglogic, wiring and memory resources 400 per FIGS. 4-7 during the taskprocessing times (i.e. when these XC and related resources are not beingused for task image transfers).

According to the herein described embodiments, where XC 430 hasdedicated multiplexers 510 and 720 for each application task configuredto run on the multi-core processing fabric 110, in order to provide awrite access from any core of the array 115 to any task specific segment550 at the fabric memory 450, any number of, up to all, tasks executingon the multi-core fabric are able to concurrently write their inter-taskcommunication information to memory segments of other tasks, in aparticular implementation, at least within the scope 230 of their ownapplication, as well as their own segment. Similarly, embodiments of theinvention where XC 470 has a dedicated multiplexer 620 for each core ofthe fabric, in order to provide any core of the array 115 with a readaccess to any task specific segment 550 at memories 450, enable anynumber of, up to all, tasks of executing on the array 115 toconcurrently read their inter-task communication information frommemories 450, in a particular implementation, specifically, from theirown segments 550 at the memories 450. Moreover, such embodiments furthersupport any mix or match of concurrent writes and reads per above. Suchnon-blocking inter-task communications connectivity through the fabricnetwork 400 facilitates high data processing throughput performance forthe application programs 220 configured to run on the system 100.

Specifically, at a particular embodiment of the invention, theinter-task communication using the XCs 430, 470 and attached wiringshown in FIGS. 4-7 is supported among the set of tasks 230 of any givenindividual application program 220. Additionally, inter-applicationcommunication is supported at embodiments of system 100 through furthernetworking, I/O and memory access means, including software basedclient/server and/or peer-to-peer communications techniques andnetworking and I/O ports as well as general memories of the cores 120and the system 100. In a specific embodiment, the application-scope 230inter-task communication is facilitated through providing the tasks 230of any given application, while executing on the core array 115, with awrite access 410 to the segments 550 of each others (including theirown), in the memory 450, and a read access 480 to their own segments550.

Following the image transfers of a task switchover, the new taskexecuting on any given core has a connection through XC 470 to itsmemory segment 550, so that data specific to the new task can be readfrom the memory 450 to its assigned execution core. In an embodiment,each task periodically polls its memory segment 550 for any newinformation written for it by other tasks, and accordingly reads anysuch new information, where applicable transferring such information, orfurther information pointed by said new information written by othertasks (e.g. from a general memory of the system 100), to the localworking memory at its processing core. In alternative embodiments, logicassociated with memory segments 550 generates interrupt-typenotifications to the core at that time associated with any given memorysegment 550 following a write operation to such segment, for the task240 executing on such core 120 to know that it has new inter-taskcommunications to read at its memory segment 550. The receiving taskcontrollable reading of data from its memory segment 550 is accomplishedin a particular embodiment, together with the data access resources andprocedures as discussed, by providing address line driven by thereceiving core to its memory segment 550; in such an embodiment, thecores provide the addresses (of task specific segment 550 scope withinmemory 450) for the data entries to be loaded on the buses 610, 480connected to the given core. While the connection from the buses 610 tobuses 480, to connect each executing task's memory segment 550 to itsprocessing core is connected through XC 470, the addresses for theexecuting tasks to read their memory segments 550 are connected from theprocessing cores of the tasks to their memory segments 550 (at leastconceptually) through XC 430, which, using same control 420, connectsalso write access data buses from the cores to memories 450. Inparticular logic implementations where separate read and write addressesare used per each given task executing at any of the cores of the array,the read address is configured to pass through the XC 530 (and logic perFIG. 7) i.e. gets connected directly from the task specific multiplexer510 to the memory segment 550 associated with the given task, while thewrite address gets further cross-connected through the XC 530.

In addition to the read access by any task to its own memory segment 550(as described above), by providing write access by tasks of a givenapplication 230 to each other's (incl. their own) memory segments 550 atthe fabric memory 450, the tasks of any given application on system cancommunicate with each other in each direction. In an embodiment of theinvention, such a write access is provided, in part, by having thecontrol information 420, i.e. the ID# of the core assigned to any givenapplication task, from controller 140 be applied to the XC 430 rightafter the completion of each run of the placement process 300 (incl.completion of task image backups), so that the updated information 420is usable by the XC already during the task processing time of the CAPsrather than only at its end (when it is used to direct the task imageback-ups). This causes that, while the tasks of any given applicationare processed at whatever set of cores within the array 115, theirassociated write-access connections 540 to memories 450 point to theircurrent application task segment 550 at the memories 450. Moreover, whenthe task 240 ID#s of any given application 220, per the Table 4 formatused for the info 420, comprise same common (at least conceptually mostsignificant bits based) prefix, and when accordingly the task memorysegments 550 of any given application 220 are within a contiguous memoryrange within the memory array 450, the set 525 (FIG. 5) of write accessbuses 540 of the tasks 230 of the same application collectively point tothe collective memory range of that application within the memory 450.As such, by providing a further XC 530 between said set of write accessbuses 525 of a given application and the eventual write access buses 545to the task segments 550 of the given application at memory 450, and byhaving the application tasks from their processing cores to provide thecontrol to XC 530, along with their write access bus signals throughtheir task specific multiplexers 510, write access by any task of anapplication to the memory segments 550 of all tasks of the sameapplication is accomplished. Note that according the embodimentsdescribed here in at detail level, there is one XC 530 per eachapplication 220 supported by the system 100.

At the task memory image transfer time for cores subject to taskswitchover, the XCs 530 are to be controlled to pass through the imagetransfer from any core to the memory segment 550 dedicated to the taskfor which the given core was assigned to prior the switchover. In anembodiment, this image transfer time control 535 for XCs 530 is providedby the controller 140. Alternatively, it can be provided by theapplication tasks, using same mechanisms as during the task processingtime, i.e., during the time periods outside the task image transfertimes for any given core (described in the following).

During such task processing times, and while a task at a given core hasan active write request or operation ongoing, the bus 410 from each corethrough the multiplexers 510 to the XC 530 identifies, among otherrelevant write access signals, at least during times of active writerequest or operation, the destination task of its write; thisidentification of the same-application-scope task ID# can be providede.g. as specified bit range 735 (FIG. 7) within the (write) address bitsof buses 410 and 525. In an embodiment, as illustrated in FIG. 7, eachapplication 220 specific XC 530 comprises a set of task 240 specificmultiplexers 720 that are controllable through bus 520 instance specificcomparators 740, each of which identifies 750 whether its associatedtask specific bus 520 instance is requesting a write access to thememory segment 550 dedicated to the task that a given multiplexer 720instance is specific to. Each comparator 740 instance sets its output750 to active state, e.g. logic high, if its input instance among set735 matches the ID# of the task 745 that a given set of comparators 740are associated with (which is the same task that the multiplexer 720 andthe arbitrator 760 to which the outputs 750 from the given set 741 ofcomparators connect to are associated with). Each of the task specificset 741 of comparators 740 has its unique task ID# input 745; in anembodiment, there is one write-source task specific comparator amongeach write-destination task specific set 741 for each task of theapplication program that the multiplexer 720 serves. Within the contextof FIG. 7, the sufficient scope of the write-destination task ID#745 isthat of intra-application; here the write-destination task ID#745 is toidentify one task 240 of among the set of tasks 230 of a givenapplication program 240 that logic and memory resources per FIG. 7 arespecific to. I.e., per any given set of comparators 741 associated witha particular write destination specific multiplexer instance 720, onecommon task ID#741, identifying that particular write destination taskwithin its application, is sufficient. Note also that the set of buses525 of a given application 220 will reach to the multiplexer 720instance of each task 240 of the given application, even though, for thesake of clarity of illustration, in FIG. 7 only one of suchtask-specific multiplexers 720 of an XC 530 of the given application isshown.

Among the writing-source task specific bus 520 instances identified bytheir comparators 740, e.g. by high logic state on signal 750 driven bya given source task specific comparator instance, as requesting a writeto the memory segment 550 of the task for which the given multiplexer720 is dedicated to, an arbitrator logic module 760 will select 770 onebus 520 instance at a time for carrying out its write 540. Thearbitrator 760 asserts a write accepted signal to the execution sourcecore of the task so selected to carry out its write, while any othercores, in an embodiment among those requesting a write simultaneously,will get a write request declined signal from the arbitrator 760. Whilenot shown in FIG. 7 for clarity of illustration of main functionalityinvolved, the write accepted/rejected signals for any given tasksexecuting at the cores of the array 115, according to an embodiment ofthe invention, are connected from the arbitrators 760 associated withtasks 230 of their application program through the XC 470, along withthe buses 610, 480 to their assigned cores; the write requestedaccepted/rejected indications from all tasks 230 of a given applicationbecome part of the bus 610 instance for any task (FIG. 6), and thus anygiven task executing on any core will continuously get the writeaccepted/rejected indications from all other tasks of its localapplication through its receive bus 480 from the module 400.

In an embodiment, the arbitrator 760 will choose the core accepted forwrite 540, in case of multiple simultaneously requesting cores, by usinga linearly revolving (incrementing the selected task ID# by one andreturning back to 0 from highest task ID#, while skipping any tasks notrequesting a write) selection algorithm; in case of a single requestingcore, the arbitrator simply accepts directly any such singular writerequest. Moreover, in order to prevent any single source task, throughotherwise potentially long lasting writes 540 to a given destinationtask memory segment 550, from blocking other tasks from their fair timeshare of write 540 access to the given destination task's memory,certain embodiments of module 760 will run their source task selectionalgorithm periodically (e.g. every 64 or 1024 clock cycles or such) and,in case of a presence of multiple tasks with an active write request,chose a revolving new task (of the tasks requesting a write) acceptedfor write access following successive runs of its writing task selectionalgorithm.

In various embodiments of the invention, software of the applicationtasks 240 supports a protocol for exchanging information betweenthemselves through the task specific segments 550 at the fabric memoryarray 450, so that multiple tasks are able to write successively to amemory segment 550 of a given task without overwriting each other'sinfo, and so that the receiving task is able to keep track of any unreadinformation written by any other task to its memory segment 550.According to one such an embodiment, each task specific memory segment550 provides a reserved inter-task communications write and read memoryspace, referred to as a spool area, along with a writing controlregister or set of such registers at specified address(es) for thewriting and reading tasks to keep track of where to write and read newinformation within the spool area. In certain scenarios, the spool areais divided into writing task specific sub-segments. In such scenarios,each writing task, being configured (e.g. through its task ID# withinits application program) the location of its sub-segment within thespool area, can itself keep track of to which address to write its nextblock of information to a given receiving task's spool area, withoutneeding a read access to any receiving task's memory segment 550. Inaddition, the writing tasks, after completing a write to a receivingtask's spool area, in the herein discussed embodiments, update theirrelated write control register at the receiving task's memory segment550, to inform the receiving task of the new write operation (e.g. theaddress up to which there is new information to be read). When eachwriting task uses its spool area at receiving task's memory segment 550as a circular buffer, with the buffer write address counter returning tozero after reaching the maximum length configured for their spoolsub-segment, one way of preventing any given writing task fromoverwriting any unread information at its spool sub-segment is that eachreceiving task repeatedly writes for its writing tasks (using the abovedescribed inter-task communication mechanism) the maximum address up towhich any given writing task is presently allowed to write at thereceiving task's spool, according to until what address the receivingtask has read the spool sub-segment in question. Through this method thewriting task is also able to keep track of how much of its writteninformation the receiving task has confirmedly read by any given time.As discussed above, in certain embodiments, the tasks repeatedly readthe write control registers of their spool areas, to know whether andwhere they have newly written information from other tasks to read. Inalternative embodiments, changes to write control registers cause readrequest notifications (e.g. through processor interrupt mechanism) frommemory segments 450 to their associated cores 120 of the array 115.

Regarding descriptions of the drawings herein, note that in variousembodiments, the modules and steps of the on-chip network 400 as well asthe controller 140 and process 300 providing control for the fabricnetwork 400 can be implemented using various combinations of softwareand hardware logic, and for instance, various memory managementtechniques can be used to pass (series of) pointers to the actualmemories where the data elements of concern are available, rather thanpassing directly the actual data, etc.

Module-Level Implementation Specifications for the Application Task toCore Placement Process:

While module level logic specifications were provided in the foregoingfor embodiments of the on-chip network 400, such details for embodimentsof the steps of the process 300 (FIG. 3) are described in the following.In an embodiment of the invention, the process 300 is implemented byhardware logic in the controller module 140 of a system per FIG. 1.

In the herein studied operating scenarios, objectives for the coreallocation algorithm 310 include maximizing the system core utilization(i e, minimizing core idling so long as there are ready tasks), whileensuring that each application gets at least up to its entitled (e.g. acontract based minimum) share of the system core capacity whenever ithas processing load to utilize such amount of cores. In the embodimentconsidered herein regarding the system capacity allocation optimizationmethods, all cores 120 of the array 115 are allocated on each run of therelated algorithms 300. Moreover, let us assume that each applicationconfigured for the given multi-core system 100 has been specified itsentitled quota 317 of the cores, at least up to which quantity of coresit is to be allocated whenever it is able to execute on such number ofcores in parallel; typically, sum of the applications' entitled quotas317 is not to exceed the total number of cores in the system. Moreprecisely, according to the herein studied embodiment of the allocationalgorithm 310, each application program on the system gets from each runof the algorithm:

-   (1) at least the lesser of its (a) entitled quota 317 and (b) Core    Demand Figure (CDF) 130 worth of the cores (and in case (a) and (b)    are equal, the ‘lesser’ shall mean either of them, e.g. (a)); plus-   (2) as much beyond that to match its CDF as is possible without    violating condition (1) for any application on the system; plus-   (3) the application's even division share of any cores remaining    unallocated after conditions (1) and (2) are satisfied for all    applications 210 sharing the system 100.

In an embodiment of the invention, the cores 120 to application programs220 allocation algorithm 310 is implemented per the followingspecifications:

-   -   (i) First, any CDFs 130 by all application programs up to their        entitled share 317 of the cores within the array 115 are met.        E.g., if a given program #P had its CDF worth zero cores and        entitlement for four cores, it will be allocated zero cores by        this step (i). As another example, if a given program #Q had its        CDF worth five cores and entitlement for one core, it will be        allocated one core by this stage of the algorithm 310.    -   (ii) Following step (i), any processing cores remaining        unallocated are allocated, one core per program at a time, among        the application programs whose demand 130 for processing cores        had not been met by the amounts of cores so far allocated to        them by preceding iterations of this step (ii) within the given        run of the algorithm 310. For instance, if after step (i) there        remained eight unallocated cores and the sum of unmet portions        of the program CDFs was six cores, the program #Q, based on the        results of step (i) per above, will be allocated four more cores        by this step (ii) to match its CDF.    -   (iii) Following step (ii), any processing cores still remaining        unallocated are allocated among the application programs evenly,        one core per program at time, until all the cores of the array        115 are allocated among the set of programs 210. Continuing the        example case from steps (i) and (ii) above, this step (iii) will        be allocating the remaining two cores to certain two of the        programs. In particular embodiments, the programs with zero        existing allocated cores, e.g. program #P from step (i), are        prioritized in allocating the remaining cores at the step (iii)        stage of the algorithm 310.

Moreover, in a certain embodiments, the iterations of steps (ii) and(iii) per above are started from a revolving application program withinthe set 210, e.g. so that the application ID # to be served first bythese iterations is incremented by one (and returning to the ID #0 afterreaching the highest application ID#) for each successive run of theprocess 300 and the algorithm 310 as part of it. Moreover, embodimentsof the invention include a feature by which the algorithm 310 allocatesfor each application program, regardless of the CDFs, at least one coreonce in a specified number (e.g. sixteen) of process 300 runs, to ensurethat each application will be able to keep at least its CDF 130 input tothe process 300 updated.

According to descriptions and examples above, the allocating of thearray of cores 115 according to the embodiments of the algorithm 310studied herein in detail is done in order to minimize the greatestamount of unmet demands for cores (i.e. greatest difference between theCDF and allocated number of cores for any given application 220) amongthe set of programs 210, while ensuring that any given program gets atleast its entitled share of the processing cores following such runs ofthe algorithm for which it demanded 130 at least such entitled share 317of the cores.

Once the set of cores 115 are allocated 310 among the set ofapplications 210, specific core 120 instances are assigned to eachapplication 220 that was allocated one or more cores on the given coreallocation algorithm run 310. In an embodiment, one schedulable 240 taskis assigned per one core 120. Objectives for the application task tocore placement algorithm 330 include minimizing the total volume oftasks to be moved between cores (for instance, this means that taskscontinuing their execution over successive CAPs will stay on theirexisting core). In certain embodiments of the invention, the systemcontroller 140 assigns the set of cores (which set can be zero at timesfor any given application) for each application, and further processesfor each application will determine how any given application utilizesthe set of cores being allocated to it. In other embodiments, such asthose studied herein in further detail, the system controller 140 alsoassigns a specific application task to each core.

To study details of an embodiment of the process 300, let us considerthe cores of the system to be identified as core #0 through core #(N−1),wherein N is the total number of pooled cores in a given system 100. Forsimplicity and clarity of the description, we will from hereon consideran example system under study with a relatively small number N ofsixteen cores. We further assume here a scenario of relatively smallnumber of also sixteen application programs configured to run on thatsystem, with these applications identified for the purpose of thedescription herein alphabetically, as application #A through application#P. Note however that the invention presents no actual limits for thenumber of cores, applications or task for a given system 100. Forexample, instances of system 100 can be configured a number ofapplications that is lesser or greater than (as well as equal to) thenumber of cores.

Following the allocation 310 of the cores among the applications, foreach active application on the system (that were allocated one or morecores by the latest run of the core allocation algorithm 310), theindividual ready-to-execute tasks 240 are selected 320 and mapped 330 tothe number of cores allocated to the given application.

The task selection 320 step of the process 300 produces, for each givenapplication of the set 210, lists 325 of to-be-executing tasks to bemapped 330 to the subset of cores of the array 115. Note that, at leastin some embodiments, the selection 320 of to-be-executing task for anygiven active application (such that was allocated 310 at least one core)is done, in addition to following of a chance in allocation 310 of coresamong applications, also following a change in task priority list 135 ofthe given application, including when not in connection to reallocation310 of cores among the applications. At least in such embodiments, theactive task to core mapping 330 is done logically individually for eachapplication, however keeping track of which cores are available for anygiven application, e.g. by running the mapping algorithm for applicationat a time, or first assigning for each application their respectivesubsets of cores among the array 115 and then running the mapping 330 inparallel for each application with new tasks to be assigned to theirexecution cores.

In the embodiments discussed herein in greater detail, the task to coremapping algorithm 330 for any application begins by keeping anycontinuing tasks, i.e., tasks selected to run on the array 115 bothbefore and after the present task switchovers, mapped to their currentcores also on the next allocation period. After that rule is met, anynewly selected tasks for the application are mapped to available cores.Specifically, assuming that a given application was allocated P (apositive integer) cores beyond those used by its continuing tasks, Phighest priority ready but not-yet-mapped tasks of the application aremapped to P next available (i.e. not-yet-assigned) cores within thearray 115 allocated to the application. In case that any givenapplication had less than P ready tasks, the highest priority other(e.g. waiting, not ready) tasks are mapped to the remaining availablecores among the number (P) cores allocated to the given application;these other tasks can thus directly begin executing on their assignedcores once they become ready. Note further than, in an embodiment, theplacing of newly selected tasks, i.e. selected tasks of applicationsbeyond the tasks continuing over the switchover transition time, is doneby mapping such yet-to-be-mapped application tasks in incrementingapplication task ID# order to available cores in incrementing core ID#order.

Summary of Process Flow and Information Formats Produced and Consumed byMain Stages of the Application Task to Core Mapping Process:

The production of updated mappings 460, 420 between selected applicationtasks 120 and the processing cores 120 of the system 100 by the process300 (FIG. 3, implemented by controller 140 in FIG. 1) from the CoreDemand Figures (CDFs) 130 and task priority lists 135 of theapplications 220 (FIG. 2), as detailed above with module levelimplementation examples, proceeds through the following stages andintermediate results (in reference to FIG. 3), according to anembodiment of the invention:

Each application 220 produces its CDF 130, e.g. an integer between 0 andthe number of cores within the array 115 expressing how manyconcurrently executable tasks 240 the application presently has ready toexecute. A possible implementation for the information format 130 issuch that logic with the core allocation module 310 periodically samplesthe CDF bits from the segment 550 at memory 450 dedicated to the (rootprocess) task #0 of each application for and, based on such samples,forms an application ID-indexed table (per Table 1 below) as a‘snapshot’ of the application CDFs to launch the process 300. An exampleof the format of the information 130 is provided in Table 1 below—notehowever that in the hardware logic implementation, the application IDindex, e.g. for range A through P, is represented by a digital number,e.g., in range 0 through 15, and as such, the application ID # serves asthe index for the CDF entries of this array, eliminating the need toactually store any representation of the application ID for the tableproviding information 130:

TABLE 1 Application ID index CDF value A 0 B 12  C 3 . . . . . . P 1

Regarding Table 1 above, note that the values of entries shown aresimply examples of possible values of some of the application CDFs, andthat the CDF values of the applications can change arbitrarily for eachnew run of the process 300 and its algorithm 310 using the snapshot ofCDFs.

Based at least in part on the application ID # indexed CDF array 130 perTable 1 above, the core allocation algorithm 310 of the process 300produces another similarly formatted application ID indexed table, whoseentries 315 at this stage are the number of cores allocated to eachapplication on the system, as shown in Table 2 below:

TABLE 2 Application ID index Number of cores allocated A 0 B 6 C 3 . . .. . . P 1

Regarding Table 2 above, note again that the values of entries shown aresimply examples of possible number cores of allocated to some of theapplications after a given run on the algorithm 310, as well as that inhardware logic this array 315 can be simply the numbers of coresallocated per application, as the application ID# for any given entry ofthis array is given by the index # of the given entry in the array 315.

The application task selection sub-process 325, done in embodiments ofthe process 300 individually, e.g. in parallel, for each application ofthe set 210, uses as its inputs the per-application core allocations 315per Table 2 above, as well as priority ordered lists 135 of ready taskIDs of any given application. Each such application specific list 135has the (descending) task priority level as its index, and theintra-application scope task ID# as the value stored at each suchindexed element, as shown in Table 3 below—notes regarding implicitindexing and non-specific examples used for values per Table 1-2 applyalso for Table 3:

TABLE 3 Task ID # (points to start Task priority index # -- address ofthe task-specific application internal sub-range 550 within the (lowerindex value signifies per-application dedicated more urgent task)address range at memory 450) 0 0 1 8 2 5 . . . . . . 15  2

In an embodiment, each application 220 of the set 210 maintains its taskpriority list 135 per Table 3 at specified address at its task #0segment 550 at memory 450, from where logic at controller 140 retrievesthis information to be used as an input for the active task selectionsub-process 320, which produces per-application listings 325 of selectedtasks. Based at least in part on the application specific active tasklistings 325, the core to application task assignment algorithm module330 produces a core ID# indexed array 420 indexed with the applicationand task IDs, and provides as its contents the processing core ID (ifany), per Table 4 below:

TABLE 4 Processing core ID Task ID (value ‘N’ here indicates that(within the application the given task is not presently Application ID-- of column to the selected for execution at any MSBs of index left) --LSBs of index of the cores) A 0 0 A 1 N . . . . . . A 15  3 B 0 1 B 1 N. . . . . . B 15  7 C 0 2 . . . . . . . . . P 0 15  . . . . . . P 15  N

Finally, by inverting the roles of index and contents from Table 4, anarray 460 expressing to which application task ID# each given core ofthe fabric 110 got assigned, per Table 5 below, is formed. Specifically,Table 5 is formed by using as its index the contents of Table 4 i.e. thecore ID numbers (other than those marked ‘N’), and as its contents theapplication task ID index from Table 4 corresponding each core ID#:

TABLE 5 Task ID (within the application Core ID index Application ID ofcolumn to the left) 0 P 0 1 B 0 2 B 8 . . . . . . . . . 15  N 1

Regarding Tables 4 and 5 above, note that the symbolic application IDs(A through P) used here for clarity will in digital logic implementationmap into numeric representations, e.g. in the range from 0 through 15.Also, the notes per Tables 1-3 above regarding the implicit indexing(i.e., core IDs for any given application ID entry are given by theindex of the given entry, eliminating the need to store the core IDs inthis array) apply for the logic implementation of Tables 4 and 5 aswell.

In hardware logic implementation the application and theintra-application task IDs of Table 5 can be bitfields of same digitalentry at any given index of the array 460; the application ID bits canbe the most significant bits (MSBs) and the task ID bits the leastsignificant (LSBs), and together these, in at least one embodiment, formthe start address of the active application task's address memory rangein the memory array 450 (for the core with ID# equaling the given indexto application task ID# array per Table 5).

By comparing Tables 4 and 5 above, it is seen that the informationcontents at Table 4 are the same as at Table 5; the difference inpurposes between them is that while Table 5 gives for any core 120 itsactive application task ID#460 to process, Table 4 gives for any givenapplication task its processing core 420 (if any at a given time). Asseen from FIGS. 4-6, the Table 5 outputs are used to configure the corespecific multiplexers 620 at XC 470, while the Table 4 outputs are usedto configure the application task specific multiplexers 510 at XC 430.

Note further that, according to a particular embodiment of process 300,when the task to core placement module 330 gets an updated list ofselected tasks 325 for one or more applications 220 (following a changein either or both of core to application allocations 315 or taskpriority lists 135 of one or more applications), it will be able toidentify from Tables 4 and 5 the following:

-   -   I. The set of activating, to-be-mapped, applications tasks,        i.e., application tasks within lists 325 not mapped to any core        by the previous run of the placement algorithm 330. This set I.        can be produced by taking those application tasks from the        updated selected task lists 325 Table 4 whose core ID# was ‘N’        (indicating task not active) in the latest Table 4;    -   II. The set of deactivating application tasks, i.e., application        tasks that were included in the previous, but not in the latest,        selected task lists 325. This set II. can be produced by taking        those application tasks from the latest Table 4 whose core ID#        was not ‘N’ (indicating task active) but that were not included        in the updated selected task lists 325; and    -   III. The set of available cores, i.e., cores 120 which in the        latest Table 5 were assigned to the set of deactivating tasks        (set II. above).        The placer module 330, according to such particular embodiment,        will use the above info to map the active tasks to cores of the        array in a manner that keeps all the continuing tasks executing        on their present cores, thereby maximizing utilization of the        core array 115 for processing the (revenue generating) user        applications 220. Specifically, in one such embodiment, the        placement algorithm 330 maps the individual tasks 240 within the        set I. of activating tasks in their increasing application task        ID# order for processing at core instances within the set III.        of available cores in their increasing core ID# order.

In alternative embodiments, the allocation 310 stage of the process 300can, in addition to determining the number of cores from the array 115to be allocated for each given application 220, determine also thesubsets of specific cores instances assigned for the individualapplications, and pass that core to application assignment info along tothe remaining, including task placement 330, stages of the process 300.In such alternative embodiments, the stage 310 is to keep track of theavailable core instances than can be reallocated between applications,while the remaining stages of the process (incl. task to core placement)can be done completely independently, e.g. in parallel (incl.concurrently), for each application among the set 210.

Revenue Generation and Cost-Efficiency Improvement Techniques forEmbodiments of System 100

Embodiments of the invention involve techniques for maximizing either orboth of the following: i) the revenue over a period of time (e.g. year)for the compute capacity provider operating a given platform 100 (perFIG. 1) of a certain total cost, and ii) the on-time data processingthroughput per unit cost for the users of a given platform 100.According to various embodiments, such as one illustrated in FIG. 8,these techniques have one of more of the following type of objectives:

-   -   1) Maximizing, at given billing rates for core entitlements, the        number of core entitlements 317 sold for user contracts        supported by a given platform 100. A core entitlement 317 (CE)        herein refers to the number of cores 120 up to which amount of        cores of the array 115 a given user program 220 is assured to        get its core demand figures (CDFs) 130 met by core allocations        315 on successive runs of the algorithm 310.    -   2) Maximizing, at given billing rates for demand-based core        allocations for a billing assessment period (BAP), the total        volume of demand-based core allocations for the programs 210        configured for a given platform 100. Herein, a demand based core        allocation (DBCA) refers to an amount of cores 120 allocated 315        to a program 220 to meet that program's CDF 130 on the given BAP        (i.e., any cores allocated for a program beyond the CDF of the        program are not counted as demand based core allocations). In an        embodiment, DBCA for a given program 220 on a given core        allocation period (CAP) is taken as the lesser (or any equal) of        the CDF 130 and allocated core count 315 of the program.        These objectives generally reflect the utility for the users        running their programs 210 on a platform 100; the users are        assumed to perceive value in, and be willing to pay for, assured        access to their desired level of data processing capacity of a        given compute platform 100 and/or their actual usage of the        platform capacity. Accordingly, either or both of the above        objectives 1) and 2) are among principal factors driving the        revenue for the operator of the given platform 100.

According to an embodiment of the invention per FIG. 8, the billables(B) 318 for the operator of the platform 100 from a compute capacityservice contract with a user is based on the following equation:B=x*CE+y*DBCA (Equation 1), wherein CE stands for core entitlement 317for the user, DBCA stands for the (average) amount of cores allocationsto that user's program to meet its CDFs for the CAPs during the contracttime period in question (e.g., 1 month) and x and y are billing ratesper the contract terms that convert CE and DBCA into monetary figures.

Note that one advantage of this billing method is that a portion (i.e.the term y*DBCA; in FIG. 8, element 860) of the cost of the utilitycomputing service for a user running its program 220 on the platform 100is based on the CDFs 130 of the user's program (to the degree that CDFsare met by core allocations 315). Therefore, each user of the computecapacity service provided with a platform 100 has an economic incentiveto configure its programs 220 so that they eliminate any CDFs beyond thenumber of cores that the given program 220 is actually able to utilizeat the given time. The user applications 210 thus have the incentive tonot automatically demand 130 at least their CE worth of coresirrespective of on how many cores the given program is able to executeon in parallel at any given time. This incentive leads to increasing theaverage amount of surplus cores for runs of the core allocationalgorithm 310 i.e. cores that can be allocated in a fully demand drivenmanner (rather than in a manner to just meet the CDFs 130 by eachapplication for at least their CE figure worth of cores). Such maximallydemand driven core allocation (which nevertheless allows guaranteeingeach user application an assured deterministic minimum system capacityaccess level whenever actually demanded) facilitates providing maximizeddata processing throughput per unit cost across the set 210 of userapplications dynamically sharing the platform 100.

Moreover, in certain embodiments, either or both of the billing rates x(element 810 in FIG. 8) and y (element 840 in FIG. 8) for Equation 1 canbe specified in the contract terms to vary over time. In a particularembodiment, the term x*CE (element 830 in FIG. 8) takes the form of asum such as x₁*CE₁+x₂*CE₂, wherein x₁ is the billing rate for a coreentitlement during specified premium businesses hours (e.g.Monday-Friday 9 am-5 pm at the local time zone of the given platform oruser) and x₂ the billing rate for a core entitlement outside the premiumbusiness hours, while CE₁ and CE₂ are core entitlements for the givencontract for the premium and non-premium hours, respectively. Naturally,in various embodiments, there can be more than just two time phases withtheir respective billing rates. For instance, in addition to premiumpricing during the business hours, also evening hours 5 pm-1 am couldhave a different billing rate than 1 am-9 am, and so forth, depending onthe popularity of the compute capacity usage during any given hours ofthe day. Similarly, different days of the week, special calendar daysetc. can have different billing rates, e.g. based on expected popularityof compute capacity on such days. Naturally, this discussion appliesalso the for the coefficient y of the term y*DBCA (element 860 in FIG.8) in Equation 1.

According to an embodiment of the invention per FIG. 8 (see also contextfrom FIGS. 3 and 1), digital hardware logic within the controller module140 functions as a billing counter 316 for the contracts supported by agiven platform 100. Such a billing counter logic use the CDFs 130 andcore allocation FIG. 315 per each user program 220 to keep track (inFIG. at submodule 850) of the series of DBCA figures for each program220 for successive capacity allocation period instances (demarcated bynotifications of CAP boundaries 314, e.g. after completion of coreallocations by module 310), and accordingly the logic of module 316multiplies (in FIG. also at submodule 850) such series of contractspecific DBCAs with the contract-specified billing rates y 840(applicable at the time of any given DBCA occurrence) to form thebilling components 860 (in Equation 1) attributable to the demand basedcore usage for the users of the programs. Certain embodiments of suchhardware billing counters 316 further similarly compute (in FIG. 8 atsubmodule 820) and add (in FIG. 8 at submodule 870) to the former (i.e.the information flow 860 in FIG. 8) also the core entitlement basedbilling components 830 to form the billables 318 for the user contractsfor the given contract period. Note that for simplicity and clarity,FIG. 8 presents the billing counter hardware logic modules as serving asingle user application 220. In certain embodiments, at least some ofthese logic resources can be time shared to serve multiple, up to all,user programs 210 on a given platform 100, while in alternativeembodiments there can be a dedicated billing counter instance 316 pereach user program configured for a given platform 100.

In an alternative logic implementation for the billing subsystemfunctionality discussed herein, in addition to the billing rate values,the signals 810, 840 provide notifications of transitions of contracttime phases at which the CE and DBCA billing rates get new values. Insuch a logic implementation, DBCA based billing counter 850 counts anaverage number of cores allocated to a given user program 220 over thecore allocation periods (CAPs) during a given billing assessment period(BAP), for which the DBCA billing rate remained a constant, andmultiplies this average DBCA amount with a total DBCA billing rate percore applicable for that BAP. Similarly, according to this logicimplementation principle, the CE based billing counter 820 counts theaverage CE level for the given program (or simply takes any constant CElevel for the time phase in question) for a given BAP for which the CEbilling rate remains a constant, and multiplies that average (or simplyconstant) CE level with a total CE billing rate applicable for that BAP.In such a logic implementation, the adder 870 accumulates the series ofbillable components 860, 830 so produced for such BAPs of constantbilling rates to form the billables 318 for the given program. Forcontext, note that in the envisioned computing service contractscenarios with platforms 100, the typical CAPs are expected to consistof tens to thousands of processing logic clock cycles, thus lasting formicroseconds or less, while the BAPs, at boundaries of which the billingrates 810, 840 change, may last from minutes to hours, comprisingseveral millions to billions of CAPs. Finally, the contract invoicingperiods may be calendar months, thus typically comprising tens tohundreds BAPs.

Furthermore, compute capacity provider operating a platform 100 canoffer different types of CE time profiles for different application 220types. For instance, a service provider operating a platform 100 couldsell four basic contract types with differing CE time profiles perexamples of contract plans A, B, C and D in Table 6 below:

TABLE 6 Plan A B C D Contract type Sum of CEs = enter- cores neededenter- tain- always for the below prise ment batch on contract mixNumber of 1 3 1 2 contracts CEs - business 8 2 0 1 16 time hoursprofiled: evening 1 4 0 1 15 hours night 0 2 8 1 16 hours Max during 1624 h: CEs - flat: any hour 8 4 8 1 30 Efficiency gain of time profiledCEs vs. flat CEs: 87.5%

As illustrated in Table 6, the capability per the invention to allowconfiguring compute capacity contracts with differing CE time profiles,particularly contract types with non-overlapping CE peaks on a givenplatform 100, can facilitate both improving the computingcost-efficiency for the users of the compute service provided throughthe platform as well as increasing the revenues that the computecapacity service provider is able to achieve with the platform of acertain cost of ownership. In embodiments of the invention, either orboth of the CE and DBCA billing rates can be set for different values onthe different billing assessment periods (BAPs) of the day, week, month,etc, in order to optimally spread the user program processing load for agiven platform 100 over time, and thereby, maximize the cost efficiencyfor the users of the computing service provided with the given platformand/or the revenue generation rate for the service provider operatingthe platform. For instance, in an example scenario, the CE billing rateon business days could be $0.08 per a core for the BAP of the businesshours, $0.04 for the BAP of the evening hours, and $0.01 for the BAP ofnight hours, while DBCA billing rate, per the average number of demandbased cores allocated to a given program over the eight hours of thesedaily BAPs, could be $0.04 for the business, $0.02 for evening, and$0.01 for night BAPs. In various other scenarios, these daily BAPbilling rates can be set to any other values, and can have differingvalues on different calendar days, as well as different week days (e.g.Monday-Friday versus Saturday-Sunday) can have non-uniform BAP phasing(e.g. Saturday-Sunday could replace the business hour BAP ofMonday-Friday with ‘extended’ evening hour BAP), etc.

With the example values of Table 6 for a mix (or ‘basket’ 210) ofenterprise, entertainment (including news etc.), batch job (overnightblock data processing), and always-on type of applications 220, it canbe seen that the capability per the invention to configure applicationsfor a given system 100 with different CE time profiles can allow theservice provider operating the given system 100 to support a given set210 of applications, with their collective CE requirements, with asignificantly reduced system core 120 count, i.e., with a lower costbase for the revenues generated through supporting the given set of userapplications 210. With the numerical example shown in Table 6, thissystem core utilization efficiency gain with time-profiled contract CEscompared to flat CEs enables a reduction from 30 to 16 cores needed forthe provided mix of user contracts. In turn, this compute resourceutilization efficiency gain with time profiled CEs reduces to cost ofrevenue for the utility computing service provider by an accordantfactor. Put differently, the service provider's revenue per unit cost ofthe service provided (driven by the number of cores needed to supportgiven set 210 of contracts) is multiplied accordingly.

Note that in the discussion herein regarding the example of Table 6,also the flat CE reference that time profiled CE contracts are comparedwith is assumed to be implemented on a platform 100 that supports theapplication load adaptive core allocation as described here in referenceto FIGS. 1-7. It should be noted that the capability of the describedembodiments of the invention to support such dynamic compute resourceallocation with contract specified minimum system access levelguarantees (when so demanded) is not supported by conventional computingsystems, and as such, the contracts supported with a system 100, i.e.contracts with the capability to burst to up to the full system corecapacity while having a minimum assured level of access to the sharedmulti-core system capacity, are expected to have a higher market valuethan conventional types of contracts with either only a dedicated shareof given compute system capacity (but without a capability to burstbeyond the dedicated cores) or a capability to burst (but without aminimum core count based access level that the user contract would beguaranteed to get whenever needed). Moreover, regarding Table 6, pleasealso note that CE level of zero (0) does not in the herein discussedembodiments mean that the given contract type would not allow theapplication under that contract to execute on its host platform 100during the hours in question; instead, CE of 0 indicates that, while theapplication is not guaranteed to have its CDFs met for up to anyspecified minimum core count, it will still in practice get its demandbased fair of share of the cores allocated to it after the CDFs of setof the applications 210 up to their CE levels have been met (asdescribed under “Module-level implementation specifications for theapplication task to core placement process” in the foregoing). In fact,at times when there are no other user application expressing a positiveCDF at a given system 100, the application with CE of 0 can get its CDFsmet all the way to the maximum core count of the array 115.

It shall also be understood that the 24 hour cycle for the CE timeprofiles per example of Table 6 here is merely to illustrate thecapability per the invention to facilitate efficient combining ofapplications 220 with differing time-variable demand profiles forcompute capacity into a shared compute capacity pool 110. In variousimplementation scenarios of the invention, there can be, for instance,further variants of plans within the basic contract types (e.g. plans Athrough D per Table 6) such that offer greater CE levels than the normfor the given base plan (e.g. plan A) at specified seasons or calendardates of the year (either during the peak hours of the profile orthroughout given 24 hour days) in exchange of lower CE levels than thenorm for that base plan at other dates or seasons. Besides combiningcontracts with differing CE profiles within 24 h cycles as illustratedin Table 6 to dynamically share the same capacity pools 115, theinvention also facilitates combining the seasonally differing variantsof contracts within a given plan type (i.e. variants with non-coincidingseasonal peaks in their CE profiles) in the same capacity pools forfurther compute capacity utilization efficiency gains than the 8-hourphases shown in simplistic example of Table 6. Moreover, there can bevariants of contract types within a given base plan that have finer timegranularity in their CE profiles. For instance, among the contracts oftype B, there can be a variant that offers greater than the standard CElevel of the plan type for the night hours (e.g. 1 am-9 am) at specifictimeslots (e.g. for a news casts at for 15 minutes at 6 am, 7 am, 8 am)in exchange of lower CE at other times during the night hours.Similarly, invention facilitates efficiently combining these type ofvariants of contracts within a given type with complementary peaks andvalleys in their CE profiles also within a given phase of the 24 h cycle(e.g. the night hour phase). In particular embodiments, this type ofcombining of complementary variants (either seasonally, within 24 hcycles, etc.) of a given contract type takes place within the aggregateCE subpool of the contracts of the given base type. In the example shownin Table 6, this type of intra contract type combining of complementaryvariants can thus take place among the three contracts of type B, whoseaggregate CE level is, for instance, during the night hours worth 3*2=6cores for each CAP. Note that in embodiments of the invention withgreater number of cores, there will normally be a greater number ofapplications of any given type sharing the system (and a greater subpoolof CEs for each contract type) than what is shown in the intentionallysimple, illustrative example of Table 6. Note also that the hardwarelogic based implementation of the user application billing counters 316per FIG. 8, including the hardware logic based subcounter 820 forcomputing the CE based billables components for each given applicationfor any given CAP, allows such embodiments of the invention to support,in practical terms, infinitely fine granularity CE time profiling forthe contract types and their variants. Moreover, the capability tocustomize the contract and variant CE time profiles per theirapplication specific demands for data processing capacity, with thehardware logic based fine granularity, determinism, accuracy andefficiency, enables the computing service provider operating a platform100 to profitably sell highly competitively priced compute capacityservice contracts, with the offered customizable CE time profilesaccurately matching the processing capacity demands of any givenapplication type. Similarly, the hardware logic based billing countersper FIG. 8 support in practical terms infinitely fine time granularityfor CE pricing, through varying CE billing rate 810 values over time,even at granularity of individual CAPs if desired. With this capabilityof the invention, the users with less time sensitive programs 220, forinstance among the programs of a given type within their base plan, havean incentive to shift their processing loads (at least in term of theircore entitlements) to less busies times, to make room for CE peaks atpopular times (e.g. for news casts at even hours 6-9 am) for theapplications than can afford to pay for the more pricier CEs at suchtimes of high demand for CE (assuming flat pricing). These low-overhead,accurate hardware logic based pricing adjustment, fine granularitybilling and efficient compute platform sharing techniques per thediscussed embodiments of the invention facilitate both maximizing theusers' net value of the compute service being subscribed to as well asthe service provider's profitability.

Benefits

According to the foregoing, advantages of the contract pricing basedsystem 100 capacity utilization and application 220 performanceoptimization techniques include:

-   -   Increased user's utility, measured as demanded-and-allocated        cores per unit cost, as well as, at least in certain cases,        allocated cores per unit cost. Note that, compared to a case        where the users would purely pay for their CEs, and as such have        no direct incentive to ever demand 130 less than their CE 317        worth of cores, the billing method per the herein discussed        embodiments of the invention wherein a portion of the billables        per a user is based on the user's DBCAs 860 during the billing        assessment period, incentivizes the users to economize on their        CDFs 130 (e.g. not demand their CE worth of cores unless the        given user application is able to effectively utilize at the        time such number of cores), which, in turn, leads to there on        average being more cores, per cost unit for a system 100, to be        allocated to meet CDFs above any given user's CE, when the given        user's program is actually able benefit from such bursting. Note        also that cores allocated beyond the CDF of the user's        application do not cost the user anything, and at least in some        situations, a users' program 220 may be able to gain performance        benefit from receiving a greater than number of cores allocated        315 to it than the number of cores it demanded. Thus the        described embodiments of the invention increase the number of        utilizable parallel execution core capacity received by any        given user application on a platform 100 per unit of cost of the        computing service provided through the platform.    -   Increased revenue generating capability for the service provider        from CE based billables, per unit cost for a system 100, through        the ability to offer contract plans with mostly or fully        non-overlapping CE peaks (such as in case with plans A through D        per example of Table 6). This enables increasing the service        provider's operating cash flows with a system 100 of certain        cost level. Also, compared to a given computing service        provider's revenue level, this method reduces the provider's        cost of revenue, allowing the provider to offer more competitive        contract pricing, by passing on at least a portion of the        savings to the customers (also referred to as users) running        programs 220 on the system 100, thereby further increasing the        customer's utility of the computing service subscribed to (in        terms of compute capacity received when needed, specifically,        number of cores allocated and utilized for parallel program        execution) per unit cost of the service. Consequently, this        technique for optimally combining user contracts with        complementary CE time profiles on a given platform 100 allows        the service provider operating the platform 100 to increase the        competitiveness of its compute capacity service offering among        the prospective customers in terms of performance and price.

At more technical level, the invention allows efficiently sharing amulti-core based computing hardware among a number of applicationsoftware programs, each executing on a time variable number of cores,maximizing the whole system data processing throughput, while providingdeterministic minimum system processing capacity access levels for eachone of the applications configured to run on the given system.

Moreover, the fabric network 400 (described in relation to FIGS. 4-7)enables running any application task on the system at any of its coresat any given time, in a restriction free manner, with minimizedoverhead, including minimized core idle times, and without a need for acollective operating system software during the system runtime operation(i.e., after its startup or maintenance configuration periods) to handlematters such monitoring, prioritizing, scheduling, placing and policinguser applications and their tasks. According to the describedembodiments of the invention, the fabric network achieves this optimallyflexible use of the cores of the system in both software and hardwareimplementation efficient manner (including logic and wiring resourceefficiently), without a need for either application to application, taskto task, or core to core level cross-connectivity, as well as memoryefficiently without a need for the cores to hold more than one task'simage within their memories at a time. Instead of needing applicationtask to task or core to core cross-connects for inter-taskcommunications and/or memory image transfers, the invention achievestheir purposes by more efficiently (in terms of system resources needed)through a set of multiplexers connecting the cores to application taskspecific segments at the fabric memory. The invention thereby enablesapplication tasks running on any core of the fabric to communicate withany other task of the given application without requiring any suchcommunicating task to know whether and where (at which core) the othertasks are running at any given time. The invention thus providesarchitecturally improved scalability for parallel data processingsystems as the number of cores, applications and tasks withinapplications grows. To summarize, the invention enables each applicationprogram to dynamically get a maximized number of cores that it canutilize in parallel so long as such demand-driven core allocation allowsall applications on the system to get at least up to their entitlednumber of cores whenever their processing load actually so demands.

The invented data processing systems and methods thus enable dynamicallyoptimizing the allocation of its parallel processing capacity among anumber of concurrently running application software programs, in amanner that is adaptive to realtime processing loads offered by theapplications, with minimized system (hardware and software) overheadcosts. Furthermore, the system per FIG. 1-7 and related descriptions, inparticular when combined with pricing optimization and billingtechniques per FIG. 8 and related descriptions, enables maximizing theoverall utility computing cost-efficiency. Accordingly, benefits of theinvented, application load adaptive, minimized overhead multi-user dataprocessing system include:

-   -   Practically all the application processing time of all the cores        across the system is made available to the user applications, as        there is no need for a common system software to run on the        system (e.g. to perform in the cores traditional operating        system tasks such as time tick processing, serving interrupts,        scheduling and placing applications and their tasks to the        cores.    -   The application programs do not experience any considerable        delays in ever waiting access to their (e.g. contract-based)        entitled share of the system processing capacity, as any number        of the processing applications configured for the system can run        on the system concurrently, with a dynamically optimized number        of parallel cores allocated per an application.    -   The allocation of the processing time across all the cores of        the system among the application programs sharing the system is        adaptive to realtime processing loads of these applications.    -   There is inherent security (including, where desired, isolation)        between the individual processing applications in the system, as        each application resides in its dedicated (logical) segment of        the system memory, and can safely use the shared processing        system effectively as if it was the sole application running on        it. This hardware based security among the application programs        and tasks sharing a multi-core data processing system per the        invention further facilitates more straightforward,        cost-efficient and faster development and testing of        applications and tasks to run on such systems, as undesired        interactions between the different user application programs can        be disabled already at the system hardware resource access        level.        The invention thus enables maximizing the data processing        throughput per unit cost across all the processing applications        configured to run on the shared multi-core computing system.

The hardware based scheduling and context switching of the inventedsystem accordingly ensures that any given application gets at least itsentitled share of the shared parallel processing system capacitywhenever the given processing application actually is able to utilize atleast its entitled quota of system capacity, and as much processingcapacity beyond its entitled quota as is possible without blocking theaccess to the entitled and fair share of the processing capacity by anyother application program that is actually able at that time to utilizesuch capacity that it is entitled to. For instance, the invention thusenables any given user application to get access to the full processingcapacity of the multi-core system whenever the given application is thesole application offering processing load for the shared multi-coresystem. In effect, the invention provides for each user applicationassured access to its contract based percentage (e.g. 10%) of themulti-core system throughput capacity, plus most of the time muchgreater share, even 100%, of the processing system throughput capacity,with the cost base for any given user application being largely definedby only its committed access percentage worth of the shared multi-coreprocessing system costs.

The references [1], [2], [3], [4], [5] and [6] provide further referencespecifications and use cases for aspects and embodiments of the inventedtechniques.

CONCLUSIONS

This description and drawings are included to illustrate architectureand operation of practical embodiments of the invention, but are notmeant to limit the scope of the invention. For instance, even though thedescription does specify certain system parameters to certain types andvalues, persons of skill in the art will realize, in view of thisdescription, that any design utilizing the architectural or operationalprinciples of the disclosed systems and methods, with any set ofpractical types and values for the system parameters, is within thescope of the invention. For instance, in view of this description,persons of skill in the art will understand that the disclosedarchitecture sets no actual limit for the number of cores in a givensystem, or for the maximum number of applications or tasks to executeconcurrently. Moreover, the system elements and process steps, thoughshown as distinct to clarify the illustration and the description, canin various embodiments be merged or combined with other elements, orfurther subdivided and rearranged, etc., without departing from thespirit and scope of the invention. It will also be obvious to implementthe systems and methods disclosed herein using various combinations ofsoftware and hardware. Finally, persons of skill in the art will realizethat various embodiments of the invention can use different nomenclatureand terminology to describe the system elements, process phases etc.technical concepts in their respective implementations. Generally, fromthis description many variants will be understood by one skilled in theart that are yet encompassed by the spirit and scope of the invention.

1-25. (canceled)
 26. A system for computing resource management, thesystem comprising: a hardware logic subsystem configured toperiodically, once for each successive core allocation period (CAP),execute an algorithm allocating an array of processing cores among a setof software programs, said subsystem comprising: (i) a piece of logicconfigured to carry out a first round of the algorithm, by which round asubset of the cores are allocated among the programs so that anyactually materialized demands for the cores by each of the programs upto their respective entitled shares of the cores are met; (ii) a pieceof logic configured to carry out a second round of the algorithm, bywhich round any of the cores that remain unallocated after the firstround are allocated among the programs whose materialized demands forthe cores had not been met by amounts of the cores so far allocated tothem by the present invocation of the algorithm; and (iii) a piece oflogic configured to carry out a third round of the algorithm, by whichround any of the cores that remain unallocated after the second roundare allocated among the programs, wherein the materialized demand forthe cores by a given one of the programs is expressed as a number ofschedulable tasks that the given program has ready for execution for aCAP following a present invocation of the algorithm.
 27. The system ofclaim 26 further comprising: a hardware logic subsystem configured toassign individual programs of the set to individual cores of the arrayin a manner that assigns each such instance of the programs, which wasselected for execution on the array of cores on consecutive CAPs, tosame one of the cores for execution on each of such consecutive CAPs.28. The system of claim 26, wherein the number of schedulable tasks thatthe given program has ready for execution for the CAP following thepresent invocation of the algorithm is formed independently of (1) therespective numbers for other programs of the set, (2) the otherprograms' utilizations of any cores allocated to them, and (3)utilization of the cores across the array.
 29. The system of claim 26,wherein, on at least some invocations of the algorithm, the subset ofthe cores allocated by the first round comprises zero cores, whereas, onat least some of the other invocations of the algorithm, the subset ofthe cores allocated by the first round comprises at least one, and up toall of the, cores.
 30. A method for allocating an array of processingcores among a set of software programs for successive core allocationperiods (CAPs), the method comprising steps of: (i) initially, a subsetof the cores are allocated among the programs so that any actuallymaterialized demands for the cores by each of the programs up to theirrespective entitled shares of the cores are met; (ii) following step(i), any of the cores that remain unallocated are allocated among theprograms whose materialized demands for the cores had not been met byamounts of the cores so far allocated to them by the present exercisingof the method; and (iii) following step (ii), any of the cores thatremain unallocated are allocated among the programs, wherein thematerialized demand for the cores by a given one of the programscorresponds to a number of schedulable tasks that the given program hasready for execution for the CAP following a present exercising of themethod.
 31. The method of claim 30, wherein, wherein the number ofschedulable tasks that the given program has ready for execution for theCAP following the present exercising of the method is formed (1)independently of the respective numbers for other programs of the set,(2) irrespective of the other programs' utilizations of any coresallocated to them, and (3) so that said number, for at least some of theCAPs, exceeds the number of the cores allocated to the given program fora CAP preceding the present exercising of the method.
 32. The method ofclaim 30, wherein, for such occasions of exercising of the method whenthere is no materialized demand for the cores by a given one of theprograms, the subset of the cores allocated by first round for thatgiven program comprises zero cores.
 33. The method of claim 30, whereinthe step (iii) allocates the remaining unallocated cores so that anyprograms with no existing allocated cores are prioritized in gettingcores allocated.
 34. The method of claim 30, wherein at least one ofsteps (ii) and (iii) is exercised by iterating through the programswhile starting with a revolving program within said set on successiveexecutions of the method.
 35. The method of claim 30, wherein one of theschedulable tasks of the given program is: a task, a process, a threador a function of that given program.
 36. An application program loadadaptive data processing system comprising: an array of processing coresfor processing instructions and data of a set of software programsconfigured to share the system; a placer for repeatedly, once for eachsuccessive Core Allocation Period (CAP), assigning individual cores ofthe array to individual programs among said set; and a processing coreID indexed digital hardware logic look-up-table for storing program toprocessing core assignment information, wherein the assigning by theplacer (a) is done at least in part based on capacity demand indicatorsby at least some among the set of programs, with such an indicator by agiven program expressing a number of cores of the array that the givenprogram is demanding for a succeeding CAP, and (b) results in storing,in the processing core ID indexed digital hardware logic look-up-table,identifiers indicating which program among the set a given core amongthe array was assigned to.
 37. The system of claim 36, wherein thenumber of cores of the array that the given program is demanding for thesucceeding CAP: (i) is formed: (1) independently of the respectivenumbers for other programs of the set, (2) irrespective of the otherprograms' utilizations of any cores allocated to them; (ii) for at leastsome of the CAPs, exceeds a number of the cores assigned to the programprior to said succeeding CAP; and (ii) equals a number of cores that theprogram is able to execute on in parallel on said succeeding CAP. 38.The system of claim 36, wherein the number of cores of the array thatthe given program is demanding for the succeeding CAP corresponds to anumber of schedulable tasks that the given program has ready forexecution for that CAP, and wherein, for at least some of the CAPs, saidnumber of schedulable tasks exceeds a number of the cores assigned tothe program on the CAP preceding said succeeding CAP.
 39. The system ofclaim 36, wherein the placer comprises logic that, after receiving a newallocation of the cores among the programs to replace a presentallocation of the core slots among the programs, maps instances of theprograms to the array of cores through logic subsystems configured to:(i) identify the following: a) a set of instances of the programs fromthe new allocation that were not included in the present allocation,with this set referred to as activating program instances; b) a set ofinstances of the programs from the present allocation that are not inthe new allocation, with this set referred to as deactivating programinstances; and c) a set of cores among the array that were assigned tothe set of deactivating program instances in the present allocation,with this set referred to as available cores; and ii) assign the arrayof cores among the instances of the programs by placing each of theactivating program instances to one of the available cores, whilekeeping each given such program instance, which was included both in thepresent and the new allocation, assigned for the CAP corresponding tothe new allocation to the same core as the given program instance wasassigned on the CAP corresponding to the present allocation.
 40. Thesystem of claim 36, wherein at least one of the capacity demandindicators comprises a software variable mapped to a hardware deviceregister accessible by the placer.
 41. The system of claim 36, wherein:the set of programs are identifiable by program ID numbers from 0through a total count of the programs configured to share the systemless one; and the assigning by the placer involves storing coreallocation information in a program ID indexed digital look-up-table(LUT) within hardware logic of the system, so that at least one givenprogram ID indexed element of the LUT stores a number expressing howmany cores of the array are being allocated to a program associated withthat given program ID indexed element of the LUT.
 42. The system ofclaim 36, further including logic configured to control, at least inpart based on the assigning by the placer, which program among the setwill execute on which core among the array.
 43. The system of claim 36,wherein: the cores of the array are identifiable by core ID numbers from0 through a total count of the cores of the array less one; and in thecore ID indexed look-up-table (LUT) at least one given core ID indexedelement of the LUT stores an identifier of a program assigned to theexecute on a core associated with that given core ID indexed element ofthe LUT.
 44. A method for mapping, by a placer implemented in digitalhardware logic, a set of software programs to execute on an array ofprocessing cores of a shared data processing hardware, the methodcomprising a repeatedly exercised series of steps as follows: monitoringcapacity demand indicators of one or more programs among the set ofprograms, with said indicator of a given program expressing a number oftasks that the given program has ready for execution for a succeedingCore Allocation Period (CAP); and allocating the array of cores amongthe set of programs for the succeeding CAP at least in part based onsaid capacity demand indicators for that CAP; wherein the step ofallocating for said succeeding CAP leads to storing, in a processingcore ID indexed digital hardware logic look-up-table (LUT), identifiersindicating which program among the set a given core among the array wasassigned to for that CAP.
 45. The method of claim 44, furthercomprising: after the step of allocating has produced a new allocationof the cores among the programs to replace a present allocation of thecore slots among the programs, placing instances of the programs to thearray of cores through sub-steps of: (i) identifying the following: a) aset of instances of the programs from the new allocation that were notincluded in the present allocation, with this set referred to asactivating program instances; b) a set of instances of the programs fromthe present allocation that are not in the new allocation, with this setreferred to as deactivating program instances; and c) a set of coresamong the array that were assigned to the set of deactivating programinstances in the present allocation, with this set referred to asavailable cores; and ii) assigning the array of cores among theinstances of the programs by placing each of the activating programinstances to one of the available cores, while keeping each such programinstance, which was included both in the present and the new allocation,assigned for the CAP corresponding to the new allocation to the samecore as it was assigned on the CAP corresponding to the presentallocation.
 46. The method of claim 44, wherein the number of tasks thatthe given program has ready for execution for the succeeding CAP isformed: (1) independently of the respective numbers for other programsof the set, and (2) irrespective of the other programs' utilizations ofany cores allocated to them.
 47. The method of claim 44, wherein thenumber of tasks that the given program has ready for execution for thesucceeding CAP equals a number of cores that the program is able toexecute on in parallel on said succeeding CAP, and, for at least some ofthe CAPs, exceeds a number of the cores assigned to the program prior tosaid succeeding CAP.
 48. The method of claim 44, wherein the step ofallocating ensures that any given program gets at least its entitledshare of the processing cores following such runs of the method forwhich it demanded at least such entitled share, wherein the entitledshare of processing cores for a given program is one of: i) an evendivision of amount of the cores within the array of cores, or ii) acontract based amount of cores.
 49. The method of claim 44, wherein theprogram identifiers stored in successive addresses of the LUT directwhich of the programs will run on which of the cores on said succeedingCAP.
 50. The method of claim 44 wherein, as a result of the allocatingstep, a representation of an allocation of the array of cores among theset of programs is stored in a program ID addressed digital hardwarelogic LUT, with entries at successive addresses of the program IDaddressed LUT expressing a quantity of the processing cores beingallocated to a program corresponding to a given address of that LUT.